Bulletpapers - Understand complex papers in seconds

May 2024

Creating 3D Scenes from Images

CAT3D is a two-step method that uses a multi-view diffusion model to generate consistent novel views of a scene from input images. These views are fed into a robust 3D reconstruction pipeline to create a 3D representation that can be rendered from any viewpoint.

May 2024

Simulation tools for customizable computer vision data

This paper introduces a simulation toolkit that allows generating customizable synthetic data to systematically evaluate computer vision models. It supports adjusting various parameters like scene layout, lighting, object models and poses, camera settings to create controlled experiments. Example uses include assessing model robustness and capabilities on the same image...

May 2024

Glass boundary detection for segmentation

This paper proposes a new deep learning approach to glass segmentation that focuses on detecting glass boundaries and avoiding over-capturing spurious features. A wide, shallow network architecture is used to extract large-scale glass regions while also embedding strong boundary constraints. Attention mechanisms filter noise and supplement detail within segmented region...

May 2024

Strategic image generation for large datasets

This paper presents a framework to generate synthetic training images for large image datasets. It uses a curriculum strategy that transitions image generation from simple to complex patterns. This addresses issues in prior work of generating repetitive, simplistic images. The framework incorporates curriculum evaluation and adversarial optimization to improve diversity...

May 2024

Categorizing and Tagging Indian Folk Paintings

This paper presents a new dataset and approach to accurately classify Indian folk paintings into 12 distinct styles, as well as generate descriptive tags for each painting. A dataset of 2279 images was compiled, with 30 tags per image describing colors, themes, objects, etc. Fine-tuned CNN models achieved 91.83% accuracy in classification, while generated tags provide d...

May 2024

Tracking Skin Features with Deep Neural Networks

This paper proposes an unsupervised deep learning method to track skin features like moles across video frames, for applications in ballistocardiography heart rate measurement and quantifying degradation in Parkinson's disease. It trains a convolutional autoencoder on unlabeled facial images to create deep encodings tailored to skin features. Adding a Gaussian weight di...

May 2024

Video motion editing with consistent object content

The paper proposes a method to edit the motion of objects in a source video to match a reference video, while preserving the object's appearance and background. It uses a two-stage training strategy to separately learn object content features and motion features, enabling precise control over motion while maintaining content consistency.

May 2024

Mamba's potential in computer vision

This paper surveys recent work applying the Mamba architecture, an efficient sequence modeling technique, to computer vision tasks. Mamba shows promise in areas like image classification, segmentation, reconstruction and more due to its linear complexity and ability to capture long-range dependencies. The survey categorizes Mamba variants by application area and data ty...

May 2024

Language Models Improve Pose Estimation

This paper presents a method that uses large language models to refine 3D human pose estimates by generating natural language descriptions of physical contacts from images. These descriptions are converted into optimization constraints to capture semantics like hugs, hand-holding, and yoga poses. Without extra training data, the method performs comparably to more comple...

May 2024

Reconstructing hand and object shapes from sparse camera views

This paper proposes a method to reconstruct 3D hand and object shapes from only a few RGB camera views, using neural networks. It is challenging to reconstruct accurate and detailed 3D shapes from images, especially with occlusion. While single-view methods can generalize to new objects, they struggle with occlusion and accuracy. Dense multi-view methods are accurate bu...

May 2024

Detecting occluded pedestrians via feature completion

This paper proposes a method to improve pedestrian detection for occluded people in images. It completes missing visual features for occluded body parts by borrowing features from fully visible pedestrian examples. This aligns features between occluded and visible pedestrians to improve classification. The method locates occluded regions using feature correlations betwe...

May 2024

Self-supervised image denoising with distorted inputs

This paper analyzes a self-supervised algorithm for image denoising that can handle distorted or 'denatured' inputs. Both theoretical analysis and experiments are used to evaluate performance. Key findings show the algorithm can find good solutions for population risk, but performance on test data depends on the difficulty of the distortions. Overall this suggests the a...

May 2024

Enhancing person re-identification with uncertainty feature fusion and wise distance aggregation

This paper presents two new methods to improve person re-identification in surveillance video - uncertainty feature fusion (UFFM) and wise distance aggregation (WDA). UFFM synthesizes features from multiple images of a person to create a robust representation across different views. WDA intelligently combines multiple similarity metrics to better distinguish between peo...

April 2024

Descriptions of images capturing key challenges

The DOCCI dataset contains 15,000 images with long, detailed English descriptions annotated by humans. The images and descriptions focus on assessing critical limitations in current text-to-image models, including spatial relationships, counting, text rendering, world knowledge, and more. The descriptions distinguish each image from highly similar ones.

April 2024

Vision-based drone detection

This paper proposes a new approach for detecting drones in videos captured by drones. It uses a coarse-to-fine strategy with vision transformer networks to handle challenges like small target sizes, distortion, and real-time processing requirements. The method achieves significant improvements in detection accuracy over prior methods on three drone datasets. It also dem...

April 2024

Mobile robotic arm with visual perception

This paper introduces a novel robotic grasping system combining a visual foundation model called Segment Anything Model (SAM) with a robotic arm mounted on a mobile platform. Key advantages are SAM's versatility in segmenting diverse objects, an "eye-in-hand" depth camera enabling precise closed-loop control, and the mobile platform expanding the operational range. Toge...

April 2024

Monocular depth estimation challenge tests generalization

The paper summarizes the third Monocular Depth Estimation Challenge, which tested algorithms on their ability to generalize to complex natural and indoor scenes. 19 submissions outperformed the baseline method. 10 teams submitted reports, showing widespread use of Depth Anything model. The top method increased the 3D F-Score from 17.51% to 23.72%.

April 2024

Bridging open-source and commercial multimodal models

This paper introduces InternVL 1.5, an open-source multimodal model that aims to match proprietary counterparts in capabilities. It does so through 3 key improvements: a reusable vision encoder, dynamic high resolution, and a bilingual dataset. When evaluated on 18 benchmarks, it achieved state-of-the-art results on 8, showing it has narrowed the gap.

April 2024

3D medical vision-language pretraining with CT scans

This paper introduces CT-GLIP, a novel 3D medical vision-language pretraining method using CT scans and radiology reports. It constructs organ-level image-text pairs and an abnormality dictionary to enhance multimodal contrastive learning. When trained on 44,011 CT scan and report pairs across 104 organs and 17,702 patients, CT-GLIP demonstrates superior zero-shot organ...

April 2024

Realistic 3D talking head synthesis

This paper introduces a new method called TalkingGaussian that represents facial motions by applying smooth deformations to persistent 3D Gaussian primitives. This simplifies the learning task compared to previous methods that directly modify point appearance, avoiding distortions in dynamic regions. It also decomposes the model into separate face and mouth branches, fu...

April 2024

Learning facial expressions from limited labels

This paper proposes a semi-supervised framework called LEAF that improves facial expression recognition when labeled data is scarce. It enhances both the quality of representations and pseudo-labels. LEAF introduces hierarchical strategies to focus on expression-relevant information in features and predictions. It also assigns ambiguous pseudo-labels and enforces consis...

April 2024

Discovering visual circuits in deep networks

This paper introduces a method to automatically extract subgraphs from vision models that implement recognition of specific visual concepts. By tracing interdependent neuron activations across layers on a few example inputs, it identifies functional circuits underlying concepts that causally impact model outputs. Editing these circuits can defend models against adversar...

April 2024

Generalizable Rendering of Humans from Video

This paper introduces a novel method called the Generalizable Neural Human Renderer (GNH) to create high-quality, animatable 3D renderings of people from monocular video inputs, without needing test-time optimization for each new person. GNH focuses on efficiently transferring visual information to the output image using body shape priors and multi-view geometry. It ach...

April 2024

Decoupling SAM for robotic surgery tool segmentation

This paper proposes Surgical-DeSAM, an approach that combines object detection and segmentation models to enable real-time robotic surgery tool segmentation without manual prompting. It utilizes Swin Transformer for detection and a decoupled Segment Anything Model for segmentation, outperforming prior methods on surgery datasets.

April 2024

Multimodal models challenged by core visual perception

This paper introduces Blink, a benchmark evaluating key visual perception skills in multimodal language models. It finds that while humans can solve these visual tasks easily, state-of-the-art models still struggle significantly. Specialized computer vision models perform much better, suggesting potential pathways to improve multimodal models.

April 2024

Detail-rich video upscaling

This paper introduces VideoGigaGAN, a new approach to video super-resolution that can generate high-resolution videos with significantly more fine-grained details compared to previous methods, while still maintaining good temporal consistency across frames. It builds on top of GigaGAN, a powerful image super-resolution model, and identifies key challenges in adapting th...

April 2024

Self-supervised segmentation of visual entities

This paper presents a novel computer vision approach called SOHES that can segment visual entities in images without needing manual image annotations. It works in three main phases: generating high-quality pseudo-labels from unlabeled images, training a model on those pseudo-labels, and refining the model's predictions. A key capability is segmenting not just whole obje...

April 2024

Reconstructing Driving Scenes from Vehicle Images

This paper introduces an efficient method called 6Img-to-3D that can reconstruct interactive 3D representations of outdoor driving scenes using only six images from a vehicle's outward-facing cameras. It combines attention mechanisms, differentiable rendering, and other techniques to output a parameterized 'triplane' from which novel views can be rendered. The method is...

April 2024

Standard image and video coding harms deep vision model performance

This paper analyzes how standardized image and video compression codecs like JPEG and H.264 impact the accuracy of deep learning models across vision tasks. They find significant deteriorations in performance, especially for dense prediction tasks under high compression. The analysis goes beyond prior work to cover localization and segmentation in addition to classifica...

April 2024

Scene context helps identify objects without visual information

This paper investigates how visual scene context facilitates object identification in images, even when visual information about the target object itself is obscured or missing. The authors train neural Referring Expression Generation models on images where target objects are replaced with noise, evaluating model predictions and attention patterns. Results indicate that...

April 2024

Efficient multimodal fusion for healthcare and beyond

This paper introduces an innovative process model for combining data from multiple sources like images, text, and tabular data. It leverages techniques like embeddings and foundational models to reduce complexity and bias. The model is versatile and computationally efficient, making it suitable for healthcare applications with scarce resources. Its effectiveness is demo...

April 2024

Enhancing Gait Analysis from Low Quality Videos

This paper proposes a method to improve gait analysis on low-quality surveillance videos by using an artifact correction model before pose estimation, avoiding issues with fine-tuning pose estimators directly. Their approach trains an artifact correction model optimized specifically to enable better pose estimation, without modifying the pose model itself. Experiments s...

April 2024

Benchmarking vision models for semantic image segmentation

This paper studies how to effectively benchmark vision foundation models for the task of semantic image segmentation. It analyzes the impact of various benchmark settings on performance rankings and training efficiency. The key recommendations are to fine-tune ViT-B models with a 16x16 patch size and linear decoder, use multiple datasets, and avoid linear probing. This ...

April 2024

Annotation-free object detection using CLIP and SAM

This paper proposes Zip, a method that combines CLIP and SAM models in a classification-first-then-discovery pipeline for annotation-free object detection and instance segmentation. It finds that clustering intermediate CLIP features provides strong object boundary information to distinguish individual objects. Zip boosts SAM's performance on COCO by over 10% mask AP wi...

April 2024

Recovering materials from images using diffusion priors

This paper proposes a method to estimate object materials like albedo and roughness from images captured under unknown lighting. It uses conditional diffusion models trained on large 3D datasets to learn priors over albedo and specular shading. These priors help resolve ambiguities during optimization-based inverse rendering. A coarse-to-fine strategy further refines th...

April 2024

Short-form UGC video quality assessment challenge

This paper reviews the NTIRE 2024 Challenge focused on evaluating video quality assessment methods tailored for short user-generated content videos from the Kwai platform. The challenge dataset contains over 4000 videos annotated with quality scores. 13 teams participated, proposing solutions that advanced the state-of-the-art in quality assessment for this genre of vid...

April 2024

Reference-guided inpainting of 3D scenes

This paper proposes RefFusion, a novel approach for controllable 3D scene inpainting. It leverages an image inpainting diffusion model personalized to a reference view of the scene. This adaptation reduces variance in the score distillation process used to optimize the 3D scene, enabling significantly sharper details in the inpainted regions. RefFusion achieves state-of...

April 2024

Detecting unknown biases in text-to-image models

This paper proposes OpenBias, an automatic pipeline to detect and quantify biases in text-to-image models without needing a predefined list of biases. It has 3 stages: first, a language model proposes possible biases for a set of captions. Then, a text-to-image model generates images from those captions. Finally, a vision question answering model recognizes if those pro...

April 2024

Photometric stereo for complex shapes and materials

This paper proposes a deep learning method called RMAFF-PSN to improve photometric stereo reconstruction of complex surface regions with varying materials. It extracts multi-scale features across shallow and deep network layers and uses attention mechanisms for optimization. Experiments show improved accuracy compared to prior methods, especially for intricate geometrie...

April 2024

Obstacle detection using LiDAR and point cloud processing

This paper presents a pipeline for detecting obstacles from LiDAR point cloud data, using voxel grids, RANSAC, and clustering methods. The pipeline is designed to run efficiently on computationally constrained devices. It was tested on a robot with a LiDAR sensor, successfully detecting obstacles in real-time as the robot moved through a lab environment.

April 2024

Broadening visual understanding in AI models

This paper proposes a method to combine multiple vision systems in AI models, to create more robust visual understanding. It shows that using diverse visual features leads to state-of-the-art performance on vision-language tasks, while also reducing issues like visual hallucinations.

April 2024

Accelerator for computer vision with graph neural networks

This paper introduces an FPGA-based hardware and software system called GCV-Turbo to accelerate computer vision tasks that use graph neural networks. It consists of a hardware architecture optimized to execute computations for both convolutional and graph neural networks, and a compiler that optimizes across layers. Evaluated on tasks with images, skeletons and point cl...

April 2024

Reconstructing Objects Held by Hands

This paper presents a method to reconstruct 3D models of common household objects held in people's hands, using only a single photo as input. It is very challenging because the hand occludes much of the object. The key idea is to leverage recent advances in AI to detect the object and find a matching 3D model from a database. Experiments show this approach reconstructs ...

April 2024

Guiding facial recognition across resolutions

This paper introduces a practical framework called DRGFER that can effectively recognize facial expressions in images of varying resolutions without compromising accuracy. It works by first determining the image resolution using a Resolution Recognition Network, then assigning images to specialized facial recognition networks based on their resolution, as directed by a ...

April 2024

Direct estimation of distortion flow for rolling shutter correction

This paper proposes a new method to correct rolling shutter distortion in images by directly estimating the distortion flow field from the underlying global shutter image to the distorted rolling shutter image. This avoids limitations of prior works that estimate undistortion flow and rely on complex scaling and warping. The key ideas are: 1) A global correlation attent...

April 2024

Adapting vision foundation models for stereo matching

This paper explores adapting vision foundation models (VFMs), which are adept at extracting informative visual features, for the task of stereo matching. The authors develop ViTAS, comprised of modules for spatial differentiation and aggregating stereo and contextual information into fine-grained features. When combined with stereo matching back-end processes in ViTASte...

April 2024

Category-level object pose learning without annotations

This paper proposes a method to learn a category-level 3D object pose estimator without requiring manually annotated pose data. Instead, it leverages generative diffusion models like Zero-1-to-3 to synthesize images of objects under controlled pose variations. To handle artifacts and noise, an image encoder learns pose features via contrastive learning. A novel strategy...

April 2024

Detecting Deepfakes from Videos

This paper proposes a novel deepfake video detection method. It leverages the image encoder from the CLIP foundation model to extract rich features. A specialized decoder with spatial and temporal modules is introduced to identify artifacts. Facial component guidance further enhances detection by focusing on key facial areas. Extensive experiments demonstrate superior c...

April 2024

Precise event spotting model for sports videos

This paper introduces T-DEED, a model designed to precisely spot events in sports videos. T-DEED features an encoder-decoder architecture to capture events requiring different temporal context, and integrates specialized layers to boost the distinctiveness of frame representations. This improves its ability to differentiate similar frames and make precise predictions. E...

April 2024

Automated facial landmark diagnosis for medical imaging

This paper proposes iterative strategies to refine automated labeling of facial landmarks, which are used to analyze medical images for conditions in specialties like dermatology and plastic surgery. By leveraging feedback loops and algorithms, initial landmark labels are iteratively improved to boost accuracy and reduce manual work. Evaluations demonstrate these strate...

April 2024

Deep learning for real-world video snapshot imaging

This paper presents a deep learning framework to address key challenges holding back real-world adoption of video snapshot imaging: limited dynamic range from temporal multiplexing, and algorithm performance degradation on real camera data. The authors propose a new structural mask enabling motion-aware, full dynamic range measurement. They also develop an efficient Tra...

April 2024

Codebook-based low-light image enhancement

This paper proposes CodeEnhance, a new approach to low-light image enhancement that leverages a codebook of high-quality image features as prior knowledge. It maps low-light images to discrete codebook indices, then decodes those indices using a pre-trained decoder from high-quality images. To improve mapping accuracy, CodeEnhance integrates semantic information and ada...

April 2024

Automated prompt learning for few-shot anomaly detection

This paper proposes a new method called PromptAD to automatically learn effective prompts for guiding few-shot anomaly detection models, without needing manual prompt engineering. It introduces techniques like semantic concatenation and explicit anomaly margins to enable prompt learning in the one-class setting of anomaly detection. PromptAD achieves state-of-the-art ac...

April 2024

Long-term multi-person pose forecasting

This paper proposes a model for forecasting human poses over longer timeframes and with more people interacting, using a coarse-to-fine approach. It forecasts global trajectories first, then conditions local pose forecasts on each trajectory mode. A graph module handles agent interactions for trajectory and pose prediction. To evaluate long-term multi-agent forecasting,...

April 2024

Improving visual prompt tuning via cross-layer connections

This paper proposes a new visual prompt tuning method called iVPT. It introduces cross-layer connections between prompt tokens in adjacent layers to enable better sharing of task-relevant information. A dynamic aggregation module selectively transfers helpful information between layers. An attentive reinforcement mechanism then uses these flexible attention weights to h...

April 2024

Learning depth prediction for 3D reconstruction from images

This paper proposes a new loss function and offset module to improve depth prediction accuracy in multi-view stereo networks. The adaptive Wasserstein loss measures divergence between predicted and ground truth depth distributions. The offset module yields continuous sub-pixel depth values. Together they achieve state-of-the-art multi-view reconstruction.

April 2024

Improving 3D Reconstruction from Single Images with Spatial Reasoning

This paper proposes a new method called KYN that improves 3D scene reconstruction from a single image, especially in occluded areas. It introduces two key innovations: (1) A vision-language module that injects semantic knowledge into 3D point representations. (2) A spatial attention mechanism that aggregates representations across the scene, making each point's density ...

April 2024

Improving 3D Understanding from Multiple Views

The authors introduce a system called SAP3D that can reconstruct 3D models and generate novel views of objects from an arbitrary number of input images. As more images are provided, SAP3D adapts its internal generative model to better match the specific object instance, improving reconstruction and view synthesis quality. This bridges the gap between single-image method...

April 2024

Efficient deep learning with convolutional neural networks

This paper explores how to improve the efficiency of convolutional neural networks (convnets) for computer vision. It finds that some commonly-used convnet building blocks have low computational efficiency when executed layer-by-layer. The author proposes fusing these layers into unified 'block-fusion' kernels to reduce memory usage and increase speed. A new convnet mod...

April 2024

Reducing background noise in attention maps for weakly supervised segmentation

This paper proposes a method to reduce background noise in attention maps used for weakly supervised semantic segmentation. The method enhances Class Activation Maps (CAMs) with attention maps from a Conformer model, and adds noise during training to further suppress noise. Experiments show improved segmentation accuracy over prior methods on PASCAL VOC and COCO datasets.

April 2024

Personalized interior design with AI agents

This paper presents I-Design, an AI system that allows users without design expertise to create customized 3D indoor scenes using natural language. I-Design uses a team of AI agents to transform user text input into feasible furniture arrangements represented as scene graphs. An algorithm then determines object placement, retrieves 3D assets, and visualizes the final de...

April 2024

ResNet and Attention for Ship Classification

This paper proposes a deep learning framework that integrates ResNet50 and the Convolutional Block Attention Module (CBAM) for accurate ship classification from optical satellite imagery. The model achieves 94% accuracy across 5 ship classes by focusing on salient image features. It has applications in maritime surveillance and illegal fishing detection.

April 2024

Designing Scalable Vision Models for Vision-Language

This paper benchmarks vision models like ViT, ConvNeXt, and CoAtNet on large-scale vision-language data. It examines data/model scalability, feature resolution, and hybrid architectures. This analysis motivates the proposal of ViTamin, a novel vision model tailored for vision-language models. ViTamin outperforms ViT significantly across over 60 downstream tasks while be...

April 2024

Detecting tree pith from wood slices

This paper introduces methods to automatically detect the pith (center point) in images of tree slices. The methods analyze the concentric ring structure and other visual patterns to optimize a function that identifies the central point. Three approaches are presented: 1) APD analyzes local orientation of rings/patterns, 2) APD-PCL enhances APD for cases without clear r...

April 2024

Direct generation of co-speech gesture videos

This paper presents a new framework to directly generate realistic and temporally aligned co-speech gesture videos from speech audio. It introduces a nonlinear transformation to obtain compact yet descriptive motion features, and uses a transformer-based diffusion model to capture the relationships between gestures and speech. An optimal motion selection method enables ...

March 2024

Learning Temporal Dynamics in Scientific Models

This paper proposes SineNet, a neural network architecture for modeling complex dynamics in time-dependent partial differential equations. SineNet is composed of multiple U-Net blocks that progressively evolve high-resolution features over time, reducing misalignment issues in conventional U-Nets. An analysis shows SineNet leverages skip connections for both parallel an...

March 2024

Fixing Mask Boundaries and Exhaustiveness in COCO

The authors inspect COCO's instance masks and find issues like imprecise boundaries, missing instances, and mislabeled masks. They refine all masks and ensure exhaustive instance annotations across images to create a cleaned version called COCO-ReM. Experiments show that COCO-ReM enables more accurate benchmarking - models are no longer incorrectly penalized for predict...

March 2024

Gaussian Splatting for Face Reconstruction

This paper introduces SplatFace, a method to reconstruct 3D human faces from a small number of input images using Gaussian splatting. It incorporates a morphable face model surface to guide splat placement and orientation. Key innovations are a splat-to-surface distance metric and world-space densification process.

March 2024

Enhancing line search for neural networks

This paper improves the Armijo line search method by integrating momentum from the Adam optimizer. This enables efficient large-scale neural network training on complex datasets, outperforming previous Armijo implementations and tuned Adam learning rates. Evaluations use transformers for NLP and CNNs for image data.

March 2024

Detecting out-of-distribution objects in neural networks

This paper proposes a method called Box Abstraction Monitors (BAM) to detect out-of-distribution objects in neural networks for object detection. BAM uses simple box shapes fitted to the features of in-distribution data. At inference time, features of detected objects that fall outside these boxes are flagged as out-of-distribution. BAM is integrated into Faster R-CNN m...

March 2024

Efficient segmentation of objects in long videos

This paper proposes a transformer-based approach called MAVOS that uses an optimized long-term memory bank to accurately segment objects in long videos. It introduces a dynamic modulated cross-attention memory that effectively encodes useful features over time while maintaining consistent speed and low GPU memory usage. Extensive experiments show MAVOS achieves real-tim...

March 2024

Recovering human pose and shape from images without detection

This paper introduces a new approach called AiOS that recovers full 3D human pose and shape from images without needing a separate human detection step. It builds on the DETR object detection architecture and treats human recovery as a progressive set prediction task, using different token types to encode global human features and local joint features. Outperforming pre...

March 2024

Comparing multimodal models to vision transformers for security tasks

This paper investigates whether large multimodal models (LMMs) like Gemini can match the performance of specialized vision transformer models on security tasks requiring image analysis. The authors test Gemini and vision transformers on two tasks: detecting adversarial image triggers and visually classifying malware types. They find vision transformers significantly out...

March 2024

Detecting invisible gas leaks from images

This paper introduces a computer vision technique to detect invisible gas leaks using both regular RGB images and thermal infrared images. It also provides a new dataset called Gas-DB to facilitate further research. The proposed method outperforms previous techniques, accurately identifying gas regions by combining information from the two image types.

March 2024

Improving non-hierarchical visual Mamba

This paper proposes PlainMamba, a simplified non-hierarchical version of the Mamba state space model for visual recognition. It adapts Mamba's selective scanning to 2D images through continuous scanning and direction-aware updating. Results show gains over previous non-hierarchical models, and competitiveness with hierarchical models, especially for high-resolution inpu...

March 2024

Using diffusion models to plan views for unknown object reconstruction

This paper proposes a new approach to efficiently plan camera views for reconstructing initially unknown objects, like household items. It utilizes the latest AI diffusion models to generate a 3D model from a single photo, then plans optimal views around that model to capture images and reconstruct the real object with high quality. Their method intelligently adapts the...

March 2024

Detecting AI-Generated Videos

This paper proposes a method called AIGVDet to detect videos that are generated by AI systems. It uses a two-branch convolutional neural network to analyze spatial pixel anomalies in frames and temporal inconsistencies in optical flow. A new benchmark dataset called GVD was constructed containing over 11,000 AI-generated videos. Experiments showed AIGVDet has much highe...

March 2024

Camera-aware clustering refines labels for unsupervised person re-identification

This paper introduces a camera-aware label refinement framework to address challenges in unsupervised person re-identification. It employs reliable intra-camera clustering to refine noisy global clustering, enabling more effective self-paced discriminative model training. A camera alignment module is also proposed to reduce feature distribution discrepancies across came...

March 2024

Multi-attention network for visual tracking

This paper proposes a multi-attention associate prediction network (MAPNet) for visual tracking. It designs two novel feature matchers integrating various attentions: A category-aware matcher that captures category semantics for classification, and a spatial-aware matcher that captures spatial contexts for regression. A dual alignment module enhances correspondence betw...

March 2024

Video editing by propagating image edits

The Videoshop algorithm enables localized, semantic video editing by allowing users to modify the first frame of a video using any image editing tool. It then automatically propagates those pixel-level changes to all subsequent frames, ensuring coherent object motions and fidelity to the user's edits across the video sequence.

March 2024

Hierarchical text and image alignment for histopathology representation learning

This paper proposes a new self-supervised learning framework called HLSS that aligns hierarchical natural language descriptions with visual features in histopathology images across patient, slide, and patch levels. This helps the model learn improved representations that achieve state-of-the-art performance on downstream tasks and provide better interpretability.

March 2024

Explaining Vision Transformers via Token Transformation

This paper proposes TokenTM, a new method to explain predictions from Vision Transformer models. It focuses on quantifying the impact of token transformations within the model, alongside attention weights. Specifically, TokenTM measures changes in token vector lengths and directions before and after transformation modules. It then aggregates these effects across all lay...

March 2024

Impact of data diversity on self-supervised learning

This paper explores how training self-supervised learning models on more diverse datasets impacts performance. The key findings are that more diversity helps, but only when the distribution of data matches the end task. Large diversity from web data or AI-generated data still struggles to offset distribution differences. The experiments cover 7 methods over 200 GPU days.

March 2024

Assessing diving performance with computer vision

This paper introduces a neuro-symbolic system to evaluate the quality of platform dives using computer vision techniques. It combines neural networks, which extract information like diver poses and splashes, with rule-based analysis that mimics how human experts score dives. This provides more objective and transparent scoring than existing methods reliant on subjective...

March 2024

Unified facial analysis model

This paper introduces FaceXformer, the first transformer-based model capable of handling multiple facial analysis tasks like parsing, landmark detection, pose estimation, attribute recognition, age/gender/race estimation, and landmark visibility in a single framework. It proposes a parameter-efficient decoder that processes face and task tokens together to learn robust ...

March 2024

Recovering 3D Human Models and Camera Poses from Video

This paper introduces a new method called WHAC to estimate 3D human models and camera poses over time from monocular video, without needing complex optimization techniques. It leverages two key insights: existing human pose methods can estimate accurate depth, and human motions provide spatial cues. WHAC outperforms prior work on benchmark datasets.

March 2024

Scene dataset for surface prediction

The paper introduces MASSTAR, a large-scale, multi-modal dataset to advance research in surface prediction and completion for complex 3D scenes. It contains over 1000 scene-level models and corresponding rendered images, texts, and point clouds. The key innovation is an efficient toolchain that screens high-quality models from raw 3D data and generates multi-modal infor...

March 2024

Refining 3D human pose estimates

This paper proposes a method to refine initial 3D human pose estimates from images. It learns to predict dense 2D displacements between renderings of the initial 3D model and the image. These displacements are used to optimize the model to better align with image evidence, improving 3D accuracy. Experiments on 3DPW and RICH datasets demonstrate consistent improvements i...

March 2024

Lightweight models for facial emotion analysis

This paper introduces lightweight neural network models like MobileViT and MobileFaceNet that are trained to recognize facial expressions, valence, and arousal in photos. These models extract features that are fed into simple classifiers to predict emotion intensity, compound expressions, action units, expressions, and valence/arousal for video frames. The models signif...

March 2024

Memory-efficient basketball player segmentation

This paper introduces a framework called MISS that uses prior visual knowledge of basketball games to improve instance segmentation with limited data. It leverages information like court layout and team uniforms to optimize data preprocessing, augmentation, training and inference. Evaluations show MISS performs well in low-data, constrained-memory situations without hur...

March 2024

Efficient player detection on basketball courts

This paper proposes a computer vision model to accurately detect players, referees, coaches, and balls on a basketball court, while being highly efficient in terms of computation and memory usage. They effectively incorporate prior knowledge of the basketball domain into data preprocessing, augmentation, and model inference. Despite tight data and resource constraints, ...

March 2024

Refinement of robot object detection via proposal adaptation

This paper introduces a scalable approach to adapt object detectors in cloud robotics when robots rely on pre-trained models that degrade in new environments. A lightweight neural network called R2SNet runs locally, refining proposals by relabeling, rescoring and suppressing boxes to mitigate performance drops. Evaluated on mobile robots detecting doors, it improved acc...

March 2024

Photo-realistic image synthesis guided by semantic masks

This paper introduces SCP-Diff, a method to generate highly realistic and diverse images that accurately follow provided semantic masks. It builds on ControlNet by addressing issues around mismatch between training and inference stages. Through spatial, categorical, and joint priors, SCP-Diff aligns the distribution of noise during inference closer to that seen during t...

March 2024

Local scanning in vision models

This paper introduces a local scanning technique to improve how vision models capture spatial relationships in images. It divides images into distinct windows that are scanned individually before traversing across windows. This approach better preserves local 2D dependencies compared to flattening all spatial tokens. The paper also proposes adaptively searching for opti...

March 2024

Reconstructing 3D Scenes from Sparse Camera Views

This paper introduces a method called 3DFIRES that can reconstruct complete 3D models of complex scenes, including surfaces not visible in the input images. It works with as few as one image, but additional views enable more accurate and consistent reconstructions. Key to its approach is fusing information across images in feature space before predicting geometry.

March 2024

Text-guided editing of 3D Gaussian Splatting scenes

This paper proposes GaussCtrl, a method to edit 3D Gaussian Splatting reconstructions of scenes using natural language instructions. Key innovations enable editing multiple rendered views together, rather than iteratively, leading to greater speed and visual quality. Consistency is achieved via depth-guided editing and an attention mechanism over latent image codes.

March 2024

Vectorizing Historical Astronomy Diagrams

This paper introduces a dataset of 303 annotated historical astronomy diagrams and a transformer model to detect geometric primitives like lines, circles and arcs in these complex, deteriorated drawings. The model uses synthetic data and outperforms previous methods.

March 2024

Reconstructing 3D Scenes from Refractive Underwater Images

This paper presents a complete structure-from-motion system to reconstruct underwater 3D scenes from images captured through refractive camera housings. It robustly handles various housing types and integrates physical refraction modeling into the open-source COLMAP software. Evaluations validate accuracy without compromising robustness.

March 2024

Pig Aggression Detection with Neural Networks

This paper explores techniques to automatically detect aggressive behavior in pigs from videos, to assist farmers and reduce errors. It compares convolutional neural networks and transformer models like TimeSformer on a new dataset. The TimeSformer architecture was most effective, using attention to focus on key regions in frames over time. With computer vision, aggress...

March 2024

Sparse tuning of vision models for few-shot learning

This paper introduces a new method called Sparse MetA-Tuning (SMAT) to enhance the few-shot learning abilities of vision models. SMAT adds a secondary optimization stage after pre-training, similar to meta-tuning. But unlike standard meta-tuning, SMAT isolates subsets of parameters through sparsity for tuning on each task. This avoids interference between tasks. SMAT es...

The history of computer vision